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Abstract 

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bio- 
informatics. With the uninitiated in mind, v^e provide an accessible exposition of the SA-IS algorithm, v^^hich is the 
state of the art in suffix array construction. We also describe DisLex, a technique that allov^s standard suffix array 
construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching 
needed to support 'spaced seeds' and 'subset seeds' used in many biological applications. 
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INTRODUCTION 

The problem of finding the occurrences of a pattern 
string in a given text is one of the most fundamental 
computational tasks in bioinfomiatics. In most bio- 
informatics applications, the text is a huge database 
onto which a large volume of pattern queries are 
thrown. In such cases, precomputing an indexed 
data structure of the text allows efficient processing 
of pattern searches. 

One simple and effective data structure is a sufHx 
array, which informally is a list of the starting pos- 
itions of the sufBxes of the text, sorted by their al- 
phabetical order. Suffix arrays are easy to understand 
and implement and form the basis for a host of other 
sophisticated indexing techniques. 

SufSx arrays are related to a slightly more complex 
data structure known as a sufiix tree. Both sufHx 
arrays and suffix trees afford time-efEcient solutions 
to problems of searching for substrings in a text 
as well as a variety of other related problems. 
Historically, suffix trees received much attention be- 
cause time-efficient algorithms for their construction 



and use were developed early [1]. In bioinformatics, 
several sufSx tree-based applications (e.g. [2]) were 
developed as well as an influential textbook that lar- 
gely focused on them [3]. However, suffix trees 
suffer from a relatively large memory requirement 
and did not gain widespread popularity. One careful 
implementation [4] of suffix trees requires 20 bytes 
per input character in the worst case and in practice, 
an average of 12.55 bytes per input character for 
DNA sequences. In contrast, a sufEx array in its sim- 
plest form only requires 4 bytes per character (for 
text size < 2^^). This may not be a fair comparison, 
as a full-fledged sufiix tree is more powerful than a 
basic sufEx array in the sense that it can be used to 
solve more complex problems. Fortunately, subse- 
quent advances in theory revealed that suffix arrays 
supplemented with additional tables can substitute 
for suffix trees [5] and, as we describe here, can be 
directly constructed in linear time. 

The popularity of sufBx arrays in bioinformatics 
is evident from their application in a range of tasks 
such as pairwise sequence alignment [6—9], error 
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correction of reads from high-throughput sequencers 
[10, 11], prefix— sufEx match finding for genome as- 
sembly [12, 13], fe-mer counting [14] and sequence 
clustering [15], as well as the development of suffix 
array software expUcitly aimed at bioinfomiatics ap- 
plications [16]. 

A key requirement of any indexing method is that 
it be constructible in a time- and memory-eiEcient 
manner. Progress in the quest for an eflicient suffix 
array construction algorithm started in 1993 with 
Manber and Myers [17] who applied a prefix dou- 
bling technique for repeat detection [18] to suffix 
array construction, obtaining an O(ttlogtt) time al- 
gorithm for an input text of size n. A major break- 
through was achieved a decade later with the almost 
concurrent discovery of three different hnear-time 
algorithms by Kim et al. [19], Karkkainen and 
Sanders [20] and Ko and Aluru [21]. We will not 
attempt to recount this long history — but instead 
refer the interested readers to a thorough survey of 
results up to 2007 by Puglisi et al. [22]. Instead, we 
focus only on linear-time algorithms, and in particu- 
lar on a recent algorithm called SA-IS proposed by 
Nong etal. [23, 24]. SA-IS, which builds on previous 
work [21, 25] and their own new ideas, is a beautiful 
and practical linear-time algorithm. It is among the 
fastest algorithms available at the time of this writing, 
and it is also the basis for recent developments in 
algorithms that simultaneously optimize both time 
and memory usage [26]. The main goal of this article 
is to explain SA-IS in a way which can be understood 
by anyone having a basic background in algorithms. 
We describe SA-IS in section 'Suffix array construc- 
tion', then discuss and demonstrate the time and 
memory performance of SA-IS with some simple ex- 
periments in section 'Computational Complexity'. 

With biological sequences, the requirement that 
patterns match 'exactly' can sometimes be too 
strict; rather the search is for regions in the text 
that approximately match the query. The definition 
of an approximate match depends on the application 
at hand, and it determines the feasibility of extending 
suffix arrays to handle such queries. By a straightfor- 
ward modification of the lexical ordering of suffixes, 
suffix arrays can directly support 'subset' matching. 
Subset matching allows matching to ignore differ- 
ences between some or all characters in a predefined 
position-specific way. For example, it is possible to 
construct a modified suffix array that affords efficient 
search for all suffixes matching (a prefix of) the pat- 
tern '[ga] . . c', i.e. any occurrence of g or a 



followed by a c three positions later. Fortunately, 
as we describe in section 'Inexact pattern matching', 
suffix arrays defined under this kind of modified lex- 
ical ordering can be constructed in essentially the 
same way as conventional suffix arrays [27]. 



PRELIMINARIES 

Mathematical definitions can be an unpleasant sight; 
nonetheless, we require a set of definitions and no- 
tations that we will use throughout this text. We will 
present them in this section. We wiU also use this 
section to formally introduce suffix arrays and briefly 
describe their classic apphcation: efficient search of 
exact matches to substrings in a text. 

Definitions and notations 

Let text T be a string of characters T()---T„_i, 
where T,- denotes the i'th character of T. The char- 
acters T(),...,T„_2 are members of a predefined set of 
characters called the 'alphabet', whereas the end 
character T„_i is a 'sentinel' character (denoted $) 
not in the alphabet. For suffix trees, the sentinel is 
essential for its role in ensuring that no suffix is a 
prefix of another. They are not absolutely necessary 
in the discussion of suffix arrays, but are required by 
some of the construction algorithms. In bioinfor- 
matics, the alphabet is usually fixed and relatively 
small. For example, with DNA strings, the alphabet 
usually encountered is {a,c,g,t,n}, where n is used 
at positions where the base has not been confidently 
identified. The lexical ordering between characters 
in the alphabet (and therefore for any two strings) is 
taken to be the same as they would have appeared in 
a dictionary — except for one extra rule that the sen- 
tinel character is defined to be lexically smaller than 
any other character of the alphabet, or equivalently 
that if suffix r is a proper prefix of suffix s, r comes 
before 5 (This is the convention used in the algo- 
rithm literature. In practice some software packages 
adopt the opposite convention, with the sentinel 
character sorting last.). When applied to strings, we 
use the symbols <, > and = to denote lexical com- 
parison. The 'size' or 'length' of T is the number of 
characters in T and is denoted by \ T\. Let T, y (/ < j) 
denote the length j — 1 substring of T starting at 
Tj and ending at Ty. Let T, denote the 'suffix' 
T;. „_i of T. The 'suffix array' of T is the lexically 
ordered list of its suffixes. Of course, the suffix array 
does not hold the actual suffixes, but just the index of 
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0 1 2 3 4 5 6 7 8 9 10 11 12 13 

IMS |t |g IMS |t |g |c |a |c |c |g^ 

0 13 $ 

1 9 ac c g$ 

2 8 caccgS 

3 10 C C g $ 

4 TT C g $ 

5 IF g$ 

6 7 gcaccg$ 

7 ~5~gtgcaccg$ 

8 3 gtgtgcaccgS 

9 1 gtgtgtgcaccgS 

10 6 tgcaccg$ 

11 4 tgtgcaccg$ 

12 ~tgtgtgcaccg$ 

13 [T] tgtgtgtgcaccg$ 

Figure I: A string (above) and its suffix array (shown 
vertically) along with the position index on the left and 
the corresponding suffixes to the right. 

the starting position of each suffix. An example text 
with its sufiix array is shown in Figure 1. 

Query string search 

Given a text T along with its suffix array, which we 
denote here by SA , and a query string P, we can 
efficiently locate the occuiTences of P in T. The 
most straightforward way is binary search. Consider 
the starting positions of each match of P in the text; 
as SA'^ is sorted by suffix lexical order, all suffixes 
starting with P must be in one contiguous block in 
SA'^ . For example, in Figure 1 , the starting indices of 
suffixes with prefix gtg are aU clustered within pos- 
itions 7—9 of the suffix array. Thus, the search for 
P entails finding the two boundaries of this block, i.e. 
the smallest and largest values of / such that P is a 
prefix of the suffix starting at text position S^^[f]. 
Because the suffixes are in lexically sorted order in 
SA\ the two boundaries (or their absence if P does 
not appear anywhere in T) can be computed by two 
rounds of binary searching. 

How fast is this search? The size of SA is equal to 
the size of T, so a binary search on it requires 
0(log I T|) steps. At each step at most |P| characters 
need to be compared. Therefore, the time complex- 
ity of this search method (excluding the time to enu- 
merate all the occurrences) is 0(|P| log | T|). If | T| is 
relatively large, the multiplicative log | T| factor 
might be costly (for example for the human 
genome, log(|T|) is ~32). 

There are several ways to speed up the search op- 
eration, but they come at the cost of memory. A 
simple method is to cut down on the number of 
steps required for a binary search by constructing a 
look-up table that associates a set of fe-mers with the 



positions in the suffix aiTay where they first appear as 
a prefix. Although large values of k are prohibitive, 
this method allows some flexibility to balance the 
trade-off between search time and memory usage 
by selecting an appropriate value of k. 

More sophisticated methods also exist. Manber 
and Myers [17] show that precomputing the length 
of the longest common prefix (LCP) for certain 
pairs of suffixes can reduce the search time to 
0(|P| + log I T|). An LCP array stores for each pair 
of successive suffixes in a suffix array, the length of 
the LCP between them. Abouelhoda et al. [5] show 
that using an additional table alongside the suffix 
array and LCP array can bring the time further 
down to 0(|P|), completely removing the depend- 
ency on the text size. These methods are attractive 
because they give meaningful worst case performance 
guarantees. However, they do require at minimum a 
few bytes of memory overhead per text character, 
which can be a practical problem for bioinformatic 
applications (section 'Computational Complexity'). 



SUFFIX ARRAY CONSTRUCTION 

With a basic understanding of suffix arrays under out 
belts, we move on to the topic of how to construct 
them. Given text T, a simple way to build its suffix 
array is to sort the suffixes of T using a general string 
sorting algorithm such as radix sort [28]. This is 
simple and incurs very little memory overhead for 
the construction, but its worst case running time is 
quadratic in the length of the string. Still it is quite 
fast when the input string does not contain many 
repeated long substrings. One implementation [6] 
based on radix sort constructs a suffix array of the 
human genome in ~20 min using a decent modern- 
day computing machine (Intel(R) Core(TM) 
17-3770K 3.50 GHz CPU and 32 GB RAM). But 
note this is for an application in which the n's do not 
need to be sorted, otherwise the long mns of nnn . . . 
would cause a catastrophic increase in run-time. 

Fortunately, the suffixes of T are not an arbitrary 
collection of strings, but rather have the special prop- 
erty of being nested. It turns out that exploiting this 
property leads to more efficient algorithms, as we 
describe in this article. In section Bird's-eye view, 
we briefly outline the first three linear-time algo- 
rithms for direct suffix array construction: Kim 
etal. [19], Karkkainen and Sanders [20] and Ko and 
Aluru [21] (Theoretically, linear time can be 
achieved by first building a suffix tree and traversing 
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it to compute a suffix array. But as suffix trees are 
memory expensive, this method would largely 
defeat the whole point of making suffix arrays a prac- 
tical replacement for suffix trees.). (Linear time can be 
achieved by first building a suffix tree and traversing it 
to compute a suffix array, but suffix trees are memory 
expensive). Then in section 'A close look at SA-IS', 
we give a more detailed description of SA-IS, a recent 
algorithm proposed by Nong, Zhang and Chan [23, 
24]. SA-IS improves on the method of Ko and Alum, 
making it one of the fastest algorithms available, not 
only theoretically but practically as well. 



Bird's-eye view 

Interestingly, around the same time in 2003, three 
different linear-time methods were proposed inde- 
pendently by Kim etal. [19], Karkkiiinen and Sanders 
[20] and Ko and Aluru [21]. All of them use a similar 
divide-and-conquer (As the solutions lead to a 
single-branch recursion, we could use the more pre- 
cise (but less familiar) term 'decrease and conquer') 
strategy based on the idea that as suffixes are inher- 
ently nested, we should be able to determine the 
lexical order of all suffixes if we knew the order of 
only a select number of them. The general strategy 
can be outlined as follows: 

Divide phase: Given a text T of length n, system- 
atically choose a subset S of the suffixes of T. 
Construct a new text T' of length \S\ in such a 
way that sorting the suffixes of T' is equivalent to 
sorting the S suffixes in the original text T. 

Conquer phase: Recursively construct the suffix 
array of T' . Sorting the suffixes of T' is exactly the 
same problem (suffix array construction) we started 
with — albeit under a different alphabet and on a 
smaller input size. 

Combine phase: From the suffix array of T', com- 
pute the suffix array of T. 

The algorithms vary in their choices of S, which 
impacts many things downstream: the construction 
method and size of T', the terminating point of re- 
cursion, the complexity of the combine phase and 
consequently the running time and memory usage. 
For example, Kim etal. [19] take S to be the set of 
even-indexed suffixes, i.e. 



To., 



T4.. 



IT; 



! IS even 



T' M 



Assuming here for the sake of simplicity that n is 
even, they construct a shorter text T' of length 
n/2 from an alphabet derived from the length two 



Figure 2: Divide phase of the algorithm by Kim et al. 
Here, T=accca$. The set of sampled suffixes 

5 = {accca$ , cca$ , a$ } . 7o..i=ac; 7"2..3=cc; 
74.5 =a$. Since a$<ac<cc, RANK(7o..i)=l, 
RANK(72..3)=2, and RANK(T4..5)=0. Therefore, 

7' = 120. 



substrings (2-mers) in T. More precisely, the ith 
character of T' is defined as: 

T' = KANK(T2i..2i+i). for 0 < i < n/2, 

where RANK maps a substring Tji.^i+i to its rank in 
the lexical ordering of the set of 2-mers appearing at 
even index positions in T. Figure 2 shows an 
example of construction of T' from T = accca$. 

It is not difficult to see that T'- < T'^ 4> 
Toi... < T2/... (0 < ij < n/2), and therefore, we can 
determine the lexical ordering of S from the lexical 
ordering of suffixes of T' . This technique of repla- 
cing substrings in an original text by a single charac- 
ter (in a new alphabet) representing the substrings' 
lexical order is called lexical naming', and is a recur- 
ring theme in this article. The size of T' is half that of 
T, thus reducing the problem size by half in each 
recursion. Unfortunately, the combine phase of the 
algorithm of Kim et al. is extremely complicated. 

In contrast, the algorithm by Karkkainen etal. [20] 
selects the suffixes as follows: 

S = {T,... I ; mod 3 7^ 0}. 

which leads to a simpler divide and combine phase. 
Although we do not describe their algorithm in detail, 
we would like to give some intuition for the selection 
criterion. The key observation is that for any two 
suffix starting positions ij: in at least one pair among 
{(/,;), (i + l,j + 1), {i + 2,j + 2)} neither element is 
an exact multiple of three, and therefore the suffixes 
corresponding to that pair are both in S. Thus, once 
the S suffixes are sorted, the relative ordering of any 
two suffixes can be easily detemiined in constant time. 
Technically one may say that the set {1,2} forms a 
'difference cover' modulo 3, and this strategy can be 
generalized to covers of modulo larger than three, as 
described by Burkhardt and Karkkainen [29]. 
Unfortunately, by the construction of Karkkainen 
[20], T' is two thirds the size of T, leading to 
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computation time and working memory roughly 
proportional to 3|T|. This is not competitive with 
SA-IS described in the next section, which reduces 
the problem size to at most one half in each step and is 
faster and more memory efficient. 

A close look at SA-IS 

Following the divide-and-conquer strategy outlined 
in the previous section, we shall now take a deeper 
look at the SA-IS algorithm by Nong etal. [23, 24]. 
To maintain a balance between a readable descrip- 
tion and a rigorously complete one, we relegate 
some of the proofs to the Supplementary text. 

Divide phase 

Suffix classification and selection Given a text T of 
length n, a suffix T/... is classified as (ascending) 
type if T,... < T,+i... or \ (descending) type if 
T,... > T/+1.... The notation we use here is intended 
to be graphically mnemonic. Equivalently, the type 
of T/... starting with some character, say c, can be 
defined relative to the next character x 7^ c , follow- 
ing T; after a run of zero or more c's. If x > c then 
T,... is , otherwise T/... is \. As a special case, the 
suffix T„_i... consisting of only the sentinel character 
is defined to be The type of each suffix T,... can 
be computed efficiently by scanning T in reverse 
order and applying the following rule. 



When 


classify T, .. as: 


r, < T,+i 


/-type 


r, > T,+i 


\-type 


r, = T,+i 


Same as T.+ i... 



The correctness of the first two conditions is ob- 
vious. The correctness of the third condition follows 
from the observation that if both T; and T,+i hold 
the same character, say c, the pair of suffixes 
(T,...,T/+i...) can be obtained by prepending c 
onto the two suffixes (T,+i..., T,+2 - )- 

A type- suffix T,... is further classified as a -y/ 
(valley) if T/_i... is a \ suffix. It might be worth 
noting that with this definition, T„-\... is always a 

suffix because the sentinel character is always lex- 
ically smaller than its preceding character, and on the 
other hand To..., which has no preceding suffix, is 
not a ^ suffix, even when it happens to be an 
one. From this procedural definition, it is easy to see 
that we can identify the -y/-type suffixes by slightly 
modifying the scan mentioned above. Alternatively, 
the -s/ sufBx positions can be defined in a more 



(a) 

|t|g|t|g|t|g|t|g|c|a|c|c|g|$ 

01 2 3 4 5 6 7 8 9 10 11 12 13 
(b) . . . . 



1 g 1 1 


1 g 1 








T1..3 


1 g 1 


t 1 g 1 








T3..5 


1 g 1 t 1 g 


1 c 1 a 1 










1 ^ 1 


c 1 c 1 g 1 $ 1 








T'9..13 












Tl3..13| 


nl 1 


3 1 


3 1 5 


1 '' 


1 ° 1 



Figure 3: Divide Phase, (a) String J with its suffixes 
classified as \, ^ . (b) Construction of reduced in- 
stance T by lexical naming. 

declarative way, as the local minima of the inverse 
suffix array — the array for which element ; holds the 
sorted order rank of suffix T/.... In Figure 3a, we 
demonstrate the classification of the suffixes of 
T = tgtgtgtgcaccg$. We similarly define 
each character T,- to be of type \ or (and possibly 
also y/) in accordance with the type of suffix T,.... 
The divide phase of Ko and Alum's algorithm selects 
either the set of /'-type or \-type suffixes (which- 
ever is smaller). Ko and Aluru's choice results in a 
simple combine phase (similar to Step 2 of the SA-IS 
algorithm combine phase described later), but a fairly 
cumbersome divide phase. SA-IS uses the main 
idea of the Ko and Aluru combine phase, but selects 
-s/-type suffixes instead, and by doing so achieves 
both simple divide and combine phases. 

We close this section by noting a divide-and- 
conquer strategy does not necessarily imply the use 
of recursion. Many nonlinear, nonrecursive (or only 
partially recursive) algorithms also sort a select set of 
suffices first and then use that information to sort the 
rest. For example, Itoh and Tanaka [30] select the 
suffices T,... for which T/ > T,+i, while the algo- 
rithms of Mori [31] and Maniscalco and Puglisi 
[25] use a suffix-selection strategy almost identical 
to that used by SA-IS. 

Construction of reduced instanceT' Let us now see how 
to construct the reduced instance T' from the ^ 
suffixes. Consider the region in T starting with a 
^ suffix and ending with the next ^ suffix. From 
the above definitions, it is clear that this region con- 
sists of a run of suffixes followed by a run of \ 
suffixes, and finally a single suffix. Again, in the 
hopes of being graphically mnemonic, we denote the 
substrings going from one ^ suffix to the next as 
-v/Xy. (read 'w') substrings. As a special case, 
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T„_i..„_i consisting of the sentinel character is also 
defined to be a substring. 

The ^/\^ substrings divide T into blocks of sub- 
strings with overlap of one character (Figure 3b). 
The ^/\^ substrings are sorted based on the usual 
lexical ordering but with one extra rule: if two char- 
acters are the same, then we next look at their types, 
with defined to be larger than \. For example in 
Figure 3, T5..9 sorts before T3..5 as Tj is \ while T5 
is Z' . These rules allow us to sort the set of ^/\^ 
substrings in T and from that obtain lexical names 
for each ^/\^ substring. T' is obtained by concate- 
nating the lexical names of the -y/\^ substrings in the 
order they appear in T. (Figure 3b). 

The innocent-looking '\ sorts before y rule is 
in fact important. The intuition behind it is that be- 
tween a pair of \-type and /'-type sufExes of T, 
both starting with the same character, the \ sulEx is 
lexically smaller than the y one (Lemma SI). Thus, 
the lexical order of two suffixes of TwiU be correctly 
reflected in the order of their corresponding suffixes 
in T' . We provide a formal proof of this in the 
Supplementary text (Theorem SI). What is perhaps 
more subtle is that this extra rule eliminates the 
proper prefix problem that is inherent with lexical 
naming of variable-length substrings, by telling us if 
the prefix should come before or after the substring 
which contains it (section 'The Proper prefix pro- 
blem' in supplementary material gives an example 
and more formal discussion of this observation). 

At this point, there is one major outstanding issue. 
For this algorithm to achieve a linear run-time, we 
must be able to sort the ^/\^ substrings in linear time. 
While it is easy to sort the substrings in quadratic 
time, it is not straightforward how to accomplish 
this in linear time. Nong et al. found a surprisingly 
simple solution, which is nearly the same as the com- 
bine phase described in section 'Combine phase'. 
For completeness, however, we expHcitly describe 
the linear-time sorting of the substrings in the 
Supplementary text (Section SI. 3). 

Conquer phase 

If there are no ties in the sorting of the sub- 
strings (in other words each lexical name is unique), 
the order of the ^ suffixes can be determined with- 
out the need for further recursion. Otherwise, the 
suffix array of T' is computed recursively. 

Combine phase 

The recursion returns the order of the suffixes of T' , 
which tells us the relative order of the -y/-type 





a 1 


c Jg |t J 




Vtype |/'-type Vtype |Atype Vtype \/--tjpe Vtype 



Figure 4: Buckets of a DNA-string suffix array of 
length n. Gray indicates \-type positions. The bucket 
for 7 does not have a subbucket for / because there 
cannot be any / suffix starting with the lexically great- 
est character of the alphabet. 



suffixes of T. We wish to use this information to 
order all the suffixes of T. 

Even without this new information, we can say a 
few things about the suffix array of T. First, all suffixes 
starting with a given character wiU be in a contiguous 
block. Second, as was mentioned earher, between a 
pair of \-type and /'-type suffixes, both starting 
with the same character, the \ suffix is lexically smal- 
ler than the / one (Lemma SI). Therefore, the suffix 
array of T can be thought of as being partitioned into 
buckets, every bucket holding all the suffixes starting 
with the same character; and each bucket further par- 
titioned into two sub-buckets, one for the \-type 
which is to the left of the one for the /-type sufExes. 
For example, if T is a DNA string, its suffix array can 
be logically partitioned (with some buckets possibly 
empty) as shown in Figure 4. 

To construct the suffix array of T, we start by allo- 
cating an array A the size of T. A wiU eventually end 
up as the suffix array. The procedure can be explained 
in three major steps described below. A running ex- 
ample with the text T from Figure 3 is provided in 
Figures 6 and 7. In the following, we wiU refer to the 
character in the rth position of A as Aj. 

Step 0: This step initializes A. Set all elements of A 
to the special value of —1. Compute the bucket 
boundaries of A, by counting the frequency of 
each character type pair (e.g. a /) in the text. 
Pointers can be used to mark the boundaries — one 
for each \-type bucket pointing to its left end 
(head). (After Step 1 below, these pointers can be 
reused to point to the right end (tail) of each /- 
type bucket). Place the -yZ-type suffixes into the ends 
of their buckets in their sorted order (Figure 5). Note 
that this is not the final resting position of the y/- 
type suffixes. 

Step 1: This step uses the order of -v/-type suffixes to 
sort the \-type suffixes. Scan A from left to right, 
skipping any elements with value — 1 . For each suffix 
index Ai encountered, if T4._i is \, place Aj — l at 
the current head of its respective bucket, and then 
increment that head pointer (Figure 6). 
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a 


C 


g 


t 


1 


9 


-1 1 -1 1 -1 


-1 1 -1 1 5 


3 1 1 


-1 1 -1 1 -1 1 -1 1 



Figure 5: Array A at the end of Step 0 in which the ^ 
suffixes have been placed in their buckets in sorted 
order. Gray indicates \-type positions. This order of ^ 
suffixes is obtained from recursion. 
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Figure 6: Animation of Step I of the combine phase as 
the sweep proceeds from left to right. The original 
text 7 is also shown for reference. The t symbols 
point to the current heads of \-type subbuckets, the 
• symbol shows the current position of the sweep and 
cells with thick boundaries indicate changes. For ex- 
ample, in the topmost row, suffix index 13 is encoun- 
tered; and as 7|2 . is \-type, 12 is inserted at As, the 
current head of the bucket for \-type suffixes starting 
with g. The sweep proceeds accordingly Whenever a 
pointer reaches the edge of its bucket, we change its 
representation to a dashed arrow. From sweep position 
10 onwards, the array does not change and so this ani- 
mation excludes those steps. 



Step 2: This step uses the order of the \-type suf- 
fixes obtained from Step 1 to sort the /'-type suf- 
fixes. Scan A from right to left. For each suffix index 
Aj encountered, if T^4,_i is /'-type, place into 
the current tail of its respective bucket and decre- 
ment that tail pointer (Figure 7). 

At the end of Step 2, A h, exactly the suffix array 
of T! 

Although these steps consist of simple operations, 
the correctness of Steps 1 and 2 might not be imme- 
diately clear. Informally, Step 1 is a clever one-sweep 
implementation of the foUo vising idea. The order of 
two \-type suffixes T,... and Tj..., starting with the 
same initial character, can be determined recursively 
as Tj... < Tj... Ti+\... < Tj^\.... We can end this 
recursion when we reach the smallest integer x such 
that 1) r,+v and Ty+v are different characters or 2) at 
least one of them is of /-type (i.e. starts a -s/-type 
suffix). If exactly one is of /-type suffix, it follows 
immediately that it goes after the other in the suffix 
array, while if both of them start a -yZ-type suffix, 
their order is already known from the conquer phase. 
Step 2 can be understood similarly — it sorts the 
/-type suffixes inductively based on the order of 
\-type suffixes. A formal proof of correctness 
of the combine phase is provided in the 
Supplementary material (Theorems S3 and S4). 

In fact, SA-IS uses an almost-identical 'induced- 
sorting' procedure to sort the substrings in the 
divide phase as well (section 'The Method' in sup- 
plementary material). Finally, we know what the 'IS' 
in SA-IS stands for! 



COMPUTATIONAL COMPLEXITY 
Time complexity 

The time complexity of SA-IS is Unear in the input 
text size. This is because each divide phase results in 
the problem being reduced into a problem of size 
half or even smaller suffixes occur at boundaries 
between \ and / suffixes and therefore at most 1/2 
of all suffixes can be suffixes), and the additional 
work of dividing and combining at each level can be 
performed in linear time. 

Memory usage 

The suffix array of a length-fj text can be stored in 
0(k log «) bits of space — the suffix array holds n 
numbers, and each number can be encoded using 
log n bits. Because the suffix array itself does not 
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Figure 7: Animation of Step 2 of the combine phase as 
the sweep proceeds from right to left. The original text 
7 Is also shown for reference. The f symbols point to 
the current tails of /'-type subbuckets, the • symbol 
shows the current position of the sweep, and cells with 
thick boundaries indicate changes. For example, in the 
topmost row, suffix index 0 Is encountered, and there- 
fore no action needs to be taken. Next, suffix index 2 is 
encountered; and as 7| is /'-type, I is Inserted at A9, 
the current tail of the bucket for /-type suffixes start- 
ing with g. Whenever a pointer reaches the edge of Its 
bucket, we change its representation to a dashed arrow. 
From sweep position 2 onwards, the array does not 
change and so this animation excludes those steps. 



contain the text itself, we also need to load the text 
into main memory to be able to process queries. This 
requires another ©(wlog a) space, where a is the size 
of the alphabet. 



In many bioinformatics applications, the text 
length is shorter than 2^-, allowing each index to 
be represented using 4 bytes. Also, the alphabet 
size is small enough that 1 byte is enough to represent 
each character. This adds up to a total of 5n bytes. 
A haploid human genome contains ~3 biUion bases, 
and therefore it requires a total of 15 GB of memory 
just to hold the text and suffix array. Moreover some 
genomes are 10s or even 100s of times larger than 
that of human, and of course one may want to index 
multiple genomes. Thus minimizing memory use is 
an important concern when using sufEx arrays. 

There are several possible workarounds, some- 
times at the cost of higher query processing times. 
In many cases, the data can be partitioned into logical 
segments (e.g. chromosomes in a genome), and the 
index for each partition can be treated separately. In 
applications where some loss of infomiation is toler- 
able, one can choose to store only a subset of the 
suffixes (e.g. only every second sufEx), an idea 
known as a sparse suffix array [32]. Sparse indexing 
is not unique to sufBx arrays, and in fact is used by 
several sequence alignment tools, e.g. BLAT [33], 
indexed MegaBLAST [34]. More sophisticated, 
sufEx-array-like, reduced memory data structures 
have been developed, including methods involving 
the Burrows— Wheeler Transform [35], FM-index 
[36] and compressed suffix arrays [37]. These meth- 
ods typically reduce memory use at the expense of 
the computation time needed for pattern searches. 
Vyvemian et al. [38] give a comprehensive review 
of the trade-offs offered by these and other indexes. 

Apart from the storage memory of the index itself, 
we also need to consider the working memory 
required by the construction algorithm. This is 
defined as the additional memory required by the 
construction algorithm, excluding the memory 
used to hold the input text and the output suffix 
array. Various 'lightweight' suffix array construction 
algorithms have been proposed (e.g. [39, 40]) which 
achieve reduced working memory. 

The SA-IS algorithm is elegant not only in terms 
of computation time, but also in the way it leads to 
an implementation with most of the working 
memory allocated to bucket pointers. Assuming the 
original text T is from a small fixed alphabet, the 
number of distinct characters in the reduced text 
T' can become as large as \ T\/2, and therefore, the 
bucket pointers for the first level of recursion can 
require almost |T|/2 buckets (often considerably 
less in practice), leading to a working memory of 
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roughly 2\T\ bytes (with 4-byte pointers). 
Surprisingly, no working memory is needed to 
hold the text or suffix arrays computed during recur- 
sion, as these can be computed using the same 
memory that ultimately holds SA . The fact that 
this can be done is not obvious and somewhat 
involved, so we provide a detailed analysis of the 
memory usage of SA-IS in the Supplemental mater- 
ial 'Memory usage of SA-IS'. Intriguingly, Ge Nong 
has recently reported a new sufSx array construction 
algorithm, SACA-K [26], which achieves an 0(1) 
working space for constant alphabet, while maintain- 
ing linear runtime. SACA-K can be understood as a 
variant of SA-IS, modified with clever optimizations 
to eliminate the need for separate memory for 
bucket pointers after the first level of recursion. A 
full description is beyond the scope of this article, but 
a thorough reading of this review should be of great 
help in understanding SACA-K. 

Benchmarking 

To understand the time and memory perfomiance of 
SA-IS in practice, we performed simple experiments 
with several biological data sets that are representa- 
tive of the kind of data usually encountered in 
bioinformatics research. For more comprehensive 
benchmarks on general applications, we refer the 
reader to [22, 24]. 

Data sets 

The data sets used are summarized in Table 1 . The last 
entry in the table requires some explanation. We 
started with chr22 of the human genome. Using 
Dnemulator [41], a package for simulating polymorph- 
isms, we simulated several copies of chr22, as if they 
were coming fi"om different individuals. Dnemulator 
does this by picking real alleles based on their frequen- 
cies as reported in snpl32Common.txt, a SNP data- 
base [42] available fi-om the UCSC Genome Database. 
In this manner, we constructed four different data sets 

lable I: Different biological data sets used for tests 



Data set 



D. melanogaster (fruitfly) genome 165 

G. gallus (chicken) genome 992 

UniProtKB/Swiss-Prot protein data set 193 

UniProt fungi proteins data set 872 

Human chromosome 22 and its copies 36 to 249 



containing 1, 3, 5 and 7 different copies of chr22. This 
data set is relevant to bioinfomiatics because with 
increasing amount of sequence data becoming avail- 
able, it is likely that data sets contain genomic se- 
quences fi-om different individuals of the same species 
and/ or fi-om similar organisms. Another motivation for 
this experiment is to illustrate the problem a nonlinear, 
but usually fast, suffix array constmction algorithm ex- 
hibits when faced with many suffix pairs that share long 
common prefixes. Many biological sequences such as 
genomic sequences contain many long repeats and are 
especially prone to long common prefixes. For algo- 
rithms that directly compare suffixes to sort them, this 
means the comparison takes longer. 

Preprocessing the data sets 

For DNA data, usually the character n appears wher- 
ever the nucleotide at that position has not been 
correctly identified. We removed all occurrences of 
n and reformatted the data files as suitable for each 
software package to represent the biological se- 
quences as a single concatenated text string with a 
unique delimiter character placed between adjacent 
sequences. 

Programs 

We benchmarked several freely available suffix array 
construction programs based on different algorithms, 
including two implementations of SA-IS: one 
available from the authors of SA-IS and an implemen- 
tation by Yuta Mori (https://sites.google.com/site/ 
yuta256/sais). We also tested SACA-K [26], a recently 
published memory-efficient successor of SA-IS and 
included an implementation of the Deep-Shallow al- 
gorithm [40], which is theoretically not a linear-time 
algorithm, but has been shown to be fast and light- 
weight in practice [22]. For each program, we used 
the default parameter settings. Finally, for baseline 
comparison, we included our implementation of 



ftp://ftp.ensembl.org/pub/release-73/fasta/drosophilajinelanogaster/dna/ 

ftp://ftp.ensembl.org/pub/release-73/fasta/gallus.gallus/dna/ 
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http://hgdownload.cse.ucsc.edu/goldenPath/hgl9/bigZips/ and simulation 
(see text) 
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radix sort. The links to the programs tested here are 
available in the Supplementary material. 

Results 

We ran the programs on a machine Intel(R) 
Core(TM) i7-3770K 3.50 GHz CPU and 32 GB 
RAM with a linux-based OS installed. We measured 
running time using the usr + sys time reported 
by the Linux time command, and peak memory 
usage using the Linux top command. The running 
time and peak total memory usage of each program 
for each data set is shown in Figure 8. 

The SA-IS implementation by Mori shows superior 
time and memory performance across all data sets. 
However, when compared across implementations 
from the same group (Nong etal), SACA-K outper- 
forms SA-IS in both time and memory. Compared 
with Mori's implementation of SA-IS, Deep Shallow 
perfomis significantly poorly for the chr22 data set, but 
it is fairly competitive for the other data sets. Radix 
sort is slow for the chr22 data set, as one might expect. 
For the remainder of the data sets, it is not far off from 
the other methods. However, we should note that the 
implementation of radix sort we used has an applica- 
tion specific advantage in that it does not fuUy sort the 



suflixes, but instead only sorts up to the first delimiter 
(i.e. does not sort past the boundaries of the biological 
sequences fomiing the input text). 



INEXACT PATTERN MATCHING 

With biological sequences, we are not always look- 
ing for exact matches. As a simple example, consider 
searching for a given string in a protein-coding DNA 
sequence. Protein-coding DNA tends to exhibit sub- 
stitution at the third position of every codon, as this 
often does not affect the encoded anuno acid. We 
could therefore relax our pattern-matching require- 
ments by allowing a mismatch at every third position 
of the pattern, for example with a text T = actcg 
tact, the substring T()..5 would be a match for the 
query pattern acgcga. 

Approximate matching comes in different flavors, 
necessitating appropriate modification to ordinary 
suffix arrays. Here we deal with three kinds of ap- 
proximations: spaced seeds that allow any mis- 
matches at predetemiined positions, subset seeds 
that allow only certain kinds of mismatches at pre- 
determined positions and finally matches that are 
within a prescribed edit/Hamming distance. 
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148 



Shrestha et al. 



Spaced seeds: patterns with don't-care 
positions 

The concept of 'spaced seeds' is widely used in 
pairwise sequence alignment algorithms that use 
BLAST-Hke seed-and-extend techniques. Given two 
sequences Ti and T2, these alignment algorithms first 
identify potentially similar regions using 'seeds', short 
strings that can be found in both and T2. Originally 
seeds were required to match exactly, but it has since 
been shown that the sensitivity of these alignment al- 
gorithms increases significantly when spaced seeds that 
allow mismatches at certain positions are used [43—45]. 
This has resulted in a host of sequence alignment tools 
that rely on the concept of spaced seeds (for example, 
see [6, 44, 46, 47]). Spaced seeds have also been 
applied to the problem of correcting errors in reads 
from high-throughput sequencers [48]. 

Given the tremendous interest in spaced seeds, it is 
desirable to have a suffix array-like data structure that 
when constructed for a database string and a set of 
don't-care positions facilitates rapid pattern searches. 
We devote the rest of this section to describing the 
construction of such an index called the 'spaced 
suffix array'. First, let us start with some mathemat- 
ical definitions. 



Definitions. . .again 

The don't-care positions of a spaced seed can be 
described by a 'mask' M, a binary vector represented 
by a string over the alphabet {0,1}, with the 0 pos- 
itions of M corresponding to the don't-care pos- 
itions. Applying M to a length- |M| substring T, y 
of a string T results in a 'masked substring' Wi j, 
which is the string obtained fi-om T, j by replacing 
Tj+k for each = 0 (i.e. the characters corres- 
ponding to the don't-care positions) by a unique 
fixed character. As an example, let 
T = cagctat$, M = 101 and * be the replace- 
ment character, then applying M to a substring of T, 
say Tj 3 = age, we get the masked substring 
yi..3 = a*c. 

The definition of masked substring can be ex- 
tended to substrings that are not of length |M| in 
the following manner. If T, y is shorter than M, we 
apply only the prefix of M which has the same 
length as Ti j. If T, j is longer than M, we apply 
the mask cyclically as many times as required, so 
for example the mask 101 can be thought of as 
101101101 • • •. Using this mask and the same T 
as before, ^o..! = c* and F()..4 = c * gc*. 



a g c t a t $ 

a t $ 

c a g c t a t i 

c t a t $ 

get a t $ 
t $ 

t a t $ 
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a * c t * t $ 
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c* gc* at* 

g * t a * $ 
t * 

t * t $ 



Figure 9: Contrasting the ordinary suffix array (left) 
of cagctat$ with its spaced suffix array under mask 
101 (right). The characters at don't-care positions have 
been replaced by *. 



This definition naturally extends to the concept of 
a 'masked suffix' of a suffix T,..., which we shall 
denote by F/.... The 'spaced suffix array' of T 
under mask M is a list of the masked suffixes of T 
sorted in their increasing lexical ordering. 
Depending on the mask and text, a spaced suffix 
array is in general different from an ordinary suffix 
array of the same text. Figure 9 shows one such 
example with T = cage tat $ and M = 101. 

Query processing in a spaced suffix array 

Because a spaced suffix array is constructed with a 
predetemiined don't-care pattern in mind, naturally 
the query string must also be processed under the 
same pattern. Apart from this, processing queries 
in a spaced suffix array is similar to what is done 
with ordinary suffix arrays, a topic we discussed in 
section 'Query string search'. 

Constructing a spaced suffix array 

Given a text T and a mask M, a straightforward so- 
lution to computing the spaced suffix array of T 
under mask M is to use radix sort that skips the 
don't-care positions. As we have seen in section 
'Benchmarking', radix sort can become slow for cer- 
tain inputs, and therefore faster solutions are desir- 
able. Horton et al. [27] describe a method called 
'DisLex', which uses lexical naming to transform 
the input text into a new 'DisLex text' (over a 
new alphabet), such that the desired spaced suffix 
array can be easHy derived from the 'ordinary' 
suffix array of the DisLex text. This method con- 
structs a masked suffix array in three steps: 

Step 1: Transform T to a new text T' of similar 
length such that sorting the masked suffixes of T is 
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equivalent to sorting the ordinary suffixes of T'. The 
core idea uses lexical naming, a technique we came 
across while discussing suffix array construction algo- 
rithms (e.g. Figures 2 and 3). 

Step 2: Apply any linear-time suffix array construc- 
tion algorithm on T' . 

Step 3: Reverse-transform the suffix array of T' to 
obtain the spaced suffix array of T. 

In Figure 10, we illustrate steps 1 and 2 with the 
running example of T = atggacgacact with 
M = 101. 

TransformingT to T' Assume T does not contain any 
sentinel character, and for the sake of simplicity, sup- 
pose the length of T is some multiple k of the mask 
length m = |M|. Append at the end of T a string 
$o$i • • • $2hi-2 of 2m— I sentinel characters (if T is 
not an exact multiple of m, extra sentinel characters 
can be added to make it so). The extra characters are 
lexically smaller than any character of T, and among 
themselves are related as $0 > $i > $2 > • • • 
> $2m-2- The relevance of this padding wiU be ap- 
parent later. From here onwards, T refers to this 
padded string of length of in{k + 2) — 1 . 

Take all distinct length-w masked substrings of T 
and compute their lexical ordering, for example by 
radix sorting. Let us define a mapping RANK that 
maps each distinct masked substring to its rank in this 
ordering (Figure 10b). 

Next construct a string O,„od«i from T()..(t,+i),„_i by 
replacing the length-m length substrings T,„, (,+i),„_i, 
0<i<k by RANK(?-,.„ (,+i),„_i) (Figure lOc). 
The reason behind this rather peculiar naming is 
that the suffixes of On,odm correspond to those 
masked sufExes of T whose index position has 
the property that ;' mod m = 0. In a similar fashion 
of lexically naming length-m substrings, construct 
string l,„odm from Ti„(k+iy„, string 2„^odm from 
T2..{k+i)m+i, and so on up to (w-l),^^^,,, from 
T,„_i {fe+2)m-2 (Figure lOd, e). Finally, concatenate 
Omod;H> Imodm, ... , (ft! - l)„oa„, ^o obtain the DisLex 
text T' (Figure lOf). 

Reverse transformation Step 2 produces the ordinary 
suffix array of T'. The spaced suffix array of T can 
be easily computed using the one-to-one corres- 
pondence between the suffixes of T' and the 
masked suffixes of T. Let us see how to describe 
this correspondence in more mathematical terms. 

Consider a suffix T. of T' . If we think of T' in 
terms of the m blocks Omodm to {m — l)niod,„ each 
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Figure 10: A demonstration of how DisLex constructs 
a spaced suffix array using an example string T = atg 
gacgacac$ and mask M=101. The characters at 
the 0 positions of the mask have been mapped to the 
character *. (a) The input string with extra 
padding, (b) Lexically sorting all the length-3 distinct sub- 
strings of I The mapping RANK is defined using this 
ordering, (c), (d), (e) Constructing O^odB. lmod3 and 
2mod3. respectively (f) Constructing V by concatenating 
Omod3. I mod3 and 2mod3- (g) The suffix array of T (above) 
is transformed to the spaced suffix array of T (below). 



k + 1 long, then the position index / can be expressed 
as / = x{k + 1) + y for some integer x {0 < x < m) 
and some integer y (0 < y < k) — the value x indi- 
cates which block T'^ starts in and y teUs us its 
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position offset within this block. Then the index of 
the corresponding masked suffix of T is ym + x. For 
example in (Figure 10(f,g)), 9 in the suiEx array of T' 
corresponds to 13 = ym + x = 4*3+l in the 
masked suffix array of T. In this computation, 
X = I because 9 is in the modi block of T' 
(L9/(fe + 1)J = L9/5J = 1), y = 4 because 9 mod 
5 = 4. 

Correctness Why does sorting suffixes of T' corres- 
pond to sorting masked suffixes of T? It again helps 
to think of T' as a concatenation of the blocks of 
strings O,nod„, to (m-l),^^^,,,. Consider a pair of 
masked suffixes F^-- and ¥y... of T. Suppose 
X mod m = / and y mod in = j (with the possibility 
that i=j), then the suffix of T' corresponding to F^... 
starts in the imodm block, and that corresponding to 
¥y... Starts in the /mod™ block. If we look at just the 
strings imodm and /mod/ii (i-C isolated from T'), then 
we can see that our lexical naming technique ensures 
that the lexical relation between Wx... and "Fy... is 
exactly the same as that between their corresponding 
suffixes in I'modm andjmodm- However, in T' , there 
could be other characters following ('modm and jn^odm- 
This is where the role of the padding comes in. We 
padded T with enough sentinel characters so that the 
last character of each block receives a delimiter-type 
lexical name (lexically less than any name not invol- 
ving sentinel characters). Therefore, it does not 
matter that in T' , blocks /mod™ and jmodm are fol- 
lowed by other characters. 

Computation time The running time of the DisLex 
transformation is 0(|T| x |M|) time. In practice, 
|M| <JC |T| and the time needed for the DisLex 
transformation and reverse transformation is much 
less than that needed for the subsequent ordinary 
suffix array construction in Step 2. The results of 
simple experiments with human Chromosome 1 
and two masks are shown in Table 2. 

The main practical drawback to DisLex is that, 
depending on the mask, the alphabet size of the 
DisLex text may in general become quite large 
(although always bounded by |T|), even when the 
original alphabet size is small. The 'codon mask' 101 
is relatively innocuous in this respect, as it has only 
two care positions and therefore at most squares the 
original alphabet size (ignoring the small overhead 
due to sentinel characters). Many highly sensitive 
seeds are, however, much longer and contain many 
more care positions, which can result in a large 
alphabet. Recalling the issues raised in section 



lable 2: Running time (in seconds) of the three steps of 
DisLex with human Chromosome I (~225 million charac- 
ters) as input, using the 'codon mask' (101) and a mask 
used by PatternHunter [44] (111010010100110111) 



Mask 


Time i 


in seconds 






Step 1 


Step 2 


Step 3 


101 


2 


123 


2 


PaUemHunter 


4 


213 


2 



The suffix array of the LexText in step 2 is constructed with SA-IS, using 
code by P.H. based on Ge Nong's SA-IS implementation. 



'Computational Complexity' and analyzed in depth 
in section 'Memory usage of SA-IS' in the supple- 
mentary material, we can see that an increased alpha- 
bet size increases memory use in two ways: (i) 
depending on the mask, the DisLex text may require 
4 bytes per character, instead of one; and (ii) when 
SA-IS is used as the suffix array construction algo- 
rithm in Step 2, a large number of bucket pointers 
may be needed to induced-sort T. 

Subset seeds 

The concept of subset seeds is a generalization of 
spaced seeds. With spaced seeds, any kind of mis- 
matches are allowed at 0 positions of the mask. 
With subset seeds, we can specify the types of mis- 
matches that are allowed at each position. 

For instance, at some positions we might want to 
make no distinction between the two purines a , g 
or between the two pyrimidines c , t — the rationale 
being that transition mutations have a higher fre- 
quency than transversions. For protein sequences, 
we might want to allow mismatches between similar 
amino acids. It has been shown that a carefully 
chosen pattern of subset seeds is even more effective 
than spaced seeds in improving the sensitivity of 
alignment programs [49, 50]. 

DNA methylation measurement via bisulfite 
sequencing is an important application in which 
the need for subset seeds is especially clear. DNA 
methylation is an epigenetic modification in which 
a methyl group is chemically added to a nucleotide 
(typically cytosine) by cellular methyltransferases. 
This phenomenon is of great interest because it 
plays an important role in gene expression and cel- 
lular differentiation. Treating DNA with bisulfite 
converts the unmethylated cytosines into uracils, 
but leaves methylated cytosines unchanged. In a 
subsequent step, PCR (in which uracil acts like 
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thymine) is used to amplify those sequences. Thus, in 
bisulfite sequencing methylated cytosines appear as 
t's. If we are to then use a seed-and-extend 
method to align these sequences to a reference 
genome, we need to use subset seeds that allow a 
c-t mismatch (g-a mismatch in the reverse strand) 
to account for the fact that a t in the query could 
have possibly originally been a c. 

Like spaced seeds, we can describe a subset-seed 
pattern using a mask. UnHke a binary string mask 
used to represent spaced seeds, however, a length-m 
mask for a subset seed is an m-tuple 
(M(),Mi, . . . ,M,„_i), where each M,- is a collection 
of disjoint sets, each set defining an equivalence class 
of characters for a particular position. For instance, 
({{a,g}, {c,t}}, {{a,c,g,t}}, {{a}, 
{c}, {g}, {t}}) allows a,g-mismatch and c,t- 
mismatch in the first position, any kind of mismatch 
in the second and only exact matches in the third. 
Thus, applying a subset seed mask to a string can be 
thought of as replacing equivalent characters at each 
position by some fixed character. As with spaced 
seeds, subset seeds are applied cyclically or trimmed 
to adjust to the length of the string being masked. 

To facilitate subset-seed queries, we would like to 
sort the suffixes of a given text under a given mask. 
DisLex can easily accommodate subset seeds with a 
bit of modification in Step 1. When sorting the 
length-m distinct substrings to compute the mapping 
RANK, we must apply the subset seed mask. 

Approximate patterns based on 
edit/Hamming distance 

Another type of string search problem is one in 
which given a string T, a short query string P 
and a positive integer k, we wish to find the oc- 
currences of strings in T that are within Hamming 
or edit distance k from P. This formulation has 
applications for instance in mapping short reads ob- 
tained from high-throughput sequencing experi- 
ments to a reference genome. Modern-day 
sequencers work by first shearing the biological 
sample into fragments, and then sequencing the 
fragments. This results in billions of short DNA 
sequences (often called 'reads'), which are typically 
35 to a few 100 base pairs long depending on the 
technology. Often the first step in analyzing the 
enormous set of reads is to align each read to a 
reference genome, a task often called 'mapping'). 
For shorter reads of length up to a 100 nucleotides, 
one popular strategy to tackle the mapping problem 



has been to index the reference genome, and for 
each read search the index to find the locations that 
are within a certain edit distance from the read. 
The edit distance threshold accounts for either 
genuine differences in the sample and the reference 
due to polymorphisms or differences due to se- 
quencer errors. Tools like BWA [51], Bowtie 
[52], SOAP [53], GEM [54] and Masai [55] use 
this model of sequence similarity and employ 
index structures closely related to suffix arrays. 

Theoretically suffix arrays are poorly suited for 
such queries. One approach is to generate the set 
of all strings that are within edit distance fe of a pat- 
tern P and then search each of the strings in the suffix 
array, but this does not scale well because the set 
grows exponentially in the length of P and in k. 

Another strategy is 'backtracking' in which the 
index is parsimoniously traversed to extract all loca- 
tions of approximate matches [56]. The method also 
does not scale well for large k, but can be effective 
for read mapping [55]. 

Yet another strategy is to first filter out the loca- 
tions that are highly likely to contain matches by 
splitting P into shorter substrings and finding exact 
matches of the substrings. To get a sense of why this 
works, consider searching for occurrences that are 
within edit distance 1 of P. If P is partitioned into 
nonoverlapping substrings Pj and P2, it is necessary 
that either Pj or P2 have an 'exact' match at any 
location that is within edit distance 1 . Once the can- 
didate locations are identified, we can then investi- 
gate them to verify if they actually contain the 
required matches. Again this technique performs 
poorly as k increases because the size of the partitions 
start getting smaller and there are too many candidate 
locations that need to be verified. A modified 
suffix array that facUitates faster identification of can- 
didate locations of approximate matches is described 
m [57]. 

We refer the readers to [58, 59] for more detailed 
discussion on approximate string matching based on 
edit distances. 



CLOSING REMARKS 

We have presented a detailed exposition of the 
SA-IS algorithm for suffix array construction, in 
the context of other hnear-time algorithms and 
described techniques to adapt suffix arrays to inexact 
matching needed for bioinformatic applications. 
Although our exposition is by no means light 
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reading, we believe it is considerably more accessible 
than the original literature. 

To provide a reasonably complete and self-con- 
tained document, we intentionally limited the scope 
of our discussion. Thus, we chose not to treat many 
important topics, in particular distributed computing 
and storage of suiEx arrays. However, just as the 
SA-IS algorithm builds on previous techniques, 
future algorithms wiU most likely build on tech- 
niques, such as lexical naming, reviewed in this 
article. 

The field of genomics is rapidly developing as 
sequencing technology continues to advance and 
sequencing is applied to more and more areas outside 
of traditional genome sequencing, from gene expres- 
sion to epigenetics and even quantitative measure- 
ment of protein translation. Patient sequencing and 
analysis may soon become as routine as X-rays for 
medical diagnosis. At the same time, computational 
technology is also leaping forward as our field moves 
towards further adoption of cloud technology and 
data storage moves from peta-bytes to exa-bytes 
and even zetta-bytes. 

For an article written in such exciting times, this 
may seem like a rather stodgy exposition, fastidiously 
counting every computational operation and byte of 
memory used. However, we would argue that the 
coming deluge of sequence data will not be con- 
quered by hardware investment alone, but rather 
existing and novel efficient algorithms will likely con- 
tinue to be the (unsung) heroes — instrumental in 
realizing the potential of the sequencing revolution. 

SUPPLEMENTARY DATA 

Supplementary data are available online at http:// 
bib . oxfordj ournals .org/ . 
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